Eamon Glackin - Duolingo Director of Marketing Analytics

Take Home Data Analysis Task

This notebook contains a first pass look at the data provided, some high-level views of the data and the cleaning that will be necessary, and a first analysis of each of the fields using the package pandas_profiling

The first cell will contain all the necessary package imports

With 6187 rows, but only 6150 unique user_id's - there are 37 duplicates. Let's take a quick look at the duplicate users.

There is no obvious pattern in the duplicate records as far as I can tell. In some cases, the same user_id shows up with very different responses including different countries, ages, income levels, etc.

Pandas Profiling

Using the package pandas_profiling, I can get a quick and dirty look at the underlying data, including histograms of the distributions of each variable, correlations, data types, etc.

Now that we've examined the survey data, let's look at the data in the app usage data file

There are 35 duplicates in this file as well.

Let's now generate a profile of the app usage data